The Lancet Digital Health — Latest Matching Preprints

1

Aging Signals on Chest Radiographs: Association of Chest Radiograph-Derived Age Acceleration With Future Lung Cancer Incidence

Mitsuyama, Y.; Walston, S. L.; Takita, H.; Saito, K.; Ueda, D.

2026-03-31 radiology and imaging 10.64898/2026.03.30.26349022 medRxiv

Top 0.1%

10.7%

Show abstract

Purpose: To evaluate whether chest radiograph-derived age acceleration is associated with incident lung cancer and whether it improves discrimination beyond established lung cancer risk factors. Materials and Methods: This retrospective analysis used prospectively collected data from the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial. Baseline digitized chest radiographs from the initial screening year were analyzed using a previously validated deep learning model that estimates chest radiograph-derived age (Xp-age). Age acceleration (AgeAccel) was defined as the residual of Xp-age after calibration to chronological age using a regression model from the development dataset. A 1-year landmark design excluded participants diagnosed with lung cancer or censored within 1 year of baseline. Associations with incident lung cancer were assessed using multivariable Cox proportional-hazards models adjusted for prespecified demographic and clinical predictors, including smoking variables used in the PLCOm2012 risk prediction model. Discrimination was evaluated using the concordance index and 6-year time-dependent area under the receiver-operating-characteristic curve. Results: The analytic cohort included 23,213 participants (mean age, 62.5 years); 790 developed incident lung cancer after the landmark (mean follow-up, 16.7 years). Higher AgeAccel was associated with increased lung cancer incidence (hazard ratio, 1.10 per 1-SD increase; 95% confidence interval: 1.03- 1.17); however, addition of AgeAccel to an established risk factor model resulted in minimal change in discrimination (C-index, 0.840 vs. 0.839; time-dependent AUC at 6 years, 0.852 vs. 0.852). Attribution maps emphasized the aortic arch/mediastinal region with similar spatial patterns across smoking and lung cancer strata. Conclusion: Chest radiograph-derived age acceleration was independently associated with future lung cancer incidence.

2

Pneumonia Detection in Paediatric Chest X-Rays using Ensembled Large Language Models

Tan, J.; Tang, P. H.

2026-04-12 radiology and imaging 10.64898/2026.04.10.26347909 medRxiv

Top 0.1%

10.3%

Show abstract

Background: Paediatric pneumonia is a leading cause of childhood morbidity and mortality worldwide. Chest X-rays (CXR) are an important diagnostic tool in the diagnosis of pneumonia, but shortages in specialist radiology services lead to clinically significant delays in CXR reporting. The ability to communicate findings both to clinicians and laypersons allows MLLMs to be deployed throughout clinical workflows, from image analysis to patient communication. However, MLLMs currently underperform state-of-the-art deep learning classifiers. Objective: To evaluate the diagnostic accuracy of ensemble strategies with MLLMs compared to the baseline average agent for paediatric radiological pneumonia detection. Methods: We conducted a retrospective cohort study using paediatric CXRs from two independent hospital datasets totalling 2300 CXRs. Fifteen MedGemma-4B-it agents independently classified each CXR into five pneumonia likelihood categories. Majority voting, soft voting, and GPTOSS-20B aggregation were compared against the average agent performance. The primary metric evaluated was OvR AUROC. Secondary metrics included accuracy, sensitivity, specificity, F1-score, Cohen's kappa, and OvO AUROC. Results: Soft voting achieved improvements in OvR AUROC (p_balanced = 0.0002, p_real-world = 0.0003), accuracy (p_balanced = 0.0008, p_real-world < 0.0001), Cohen's Kappa (p_balanced = 0.0006, p_real-world = 0.0054) and OvO AUROC (p_balanced < 0.0001, p_real-world = 0.0011) across both datasets, and a superior F1-value (pbalanced = 0.0028) for the balanced dataset. Conclusion: Soft voting enhances MedGemma's diagnostic discriminatory performance for paediatric radiological pneumonia detection. Our system enables privacy-preserving, near real-time clinical decision support with explainable outputs, having potential for integration into emergency departments. Our system's high specificity supports triage by flagging high-risk radiological pneumonia cases.

3

Classifying and Differentiating Individuals with Respiratory Syncytial Virus, Influenza, and COVID-19 Cases in OpenSAFELY

Prestige, E.; Warren-Gash, C.; Quint, J. K.; Evans, D.; Costello, R. E.; Mehrkar, A.; Bacon, S.; Goldacre, B.; Barley-McMullen, S.; Yameen, F.; Shah, P.; Natt, M.; Alder, Y.; Hulme, W.; Parker, E. P. K.; Eggo, R. M.

2026-04-13 infectious diseases 10.64898/2026.04.09.26350495 medRxiv

Top 0.1%

6.8%

Show abstract

Electronic health records (EHRs) are a rich source of data which can be used to analyse health outcomes using computable phenotypes. With the approval of NHS England we used the OpenSAFELY secure analytics platform to design and assess phenotypes to classify three key respiratory viruses - respiratory syncytial virus (RSV), influenza, and COVID-19 - in English coded health data between September 2016 and August 2024. We compared specific and sensitive phenotypes to one another and to publicly available surveillance data. Cases from both phenotypes showed similar seasonal patterns to surveillance data. Sensitive phenotypes led to increased risk of misclassification than specific phenotypes for mild cases. For severe cases the risk of misclassification was higher in infants than for older adults, irrespective of the phenotype used. The phenotypes presented here offer a solution to classifying respiratory viruses from coded health records in the absence of testing information.

4

Harmonising UK primary care prescription records for research: A case study in the UK Biobank

Ytsma, C. R.; Torralbo, A.; Fitzpatrick, N. K.; Pietzner, M.; Louloudis, I.; Nguyen, D.; Ansarey, S.; Denaxas, S.

2026-04-22 health informatics 10.64898/2026.04.21.26351274 medRxiv

Top 0.1%

6.3%

Show abstract

Objective The aim of this study was to develop and validate an automated, scalable framework to harmonise fragmented UK primary care prescription records into a research-ready dataset by mapping four diverse medical ontologies to a unified, historically comprehensive reference standard. Materials and Methods We used raw prescription records for consented participants in the UK Biobank, in which participants are uniquely characterized by multiple data modalities. Primary care data were preprocessed by selecting one drug code if multiple were recorded, cleaning codes to match reference presentations, expanding code granularity based on drug descriptions, and updating outdated codes to a single reference version. Harmonisation entailed mapping British National Formulary (BNF) and Read2 codes to dm+d, the universal NHS standard vocabulary for uniquely identifying and prescribing medicines. Harmonised dm+d records were then homogenised to a single concept granularity, the Virtual Medicinal Product (VMP). We validated our methods by creating medication profiles mapping contemporary drug prescribing patterns in 312 physical and mental health conditions. Results We preprocessed 57,659,844 records (100%) from 221,868 participants (100%). Of those, 48,950 records were dropped due to lack of drug code. 7,357,572 records (13%) used multiple ontologies. Most (76%) records were encoded in BNF and most had the code granularity expanded via the drug description (N=28,034,282; 49%). 41,244,315 records (72%) were harmonised to dm+d and 99.98% of these were converted to VMP as a homogeneous dataset. Across 312 diseases, we identified 23,352 disease-drug associations with 237 medications (represented as BNF subparagraphs) that survived statistical correction of which most resembled drug - indication pairs. Conclusion Our methodology converts highly fragmented and raw prescription records with inconsistent data quality into a streamlined, enriched dataset at a single reference, version, and granularity of information. Harmonised prescription records can be easily utilised by researchers to perform large-scale analyses in research.

5

High-Throughput Observational Evidence Generation Using Linked Electronic Health Record and Claims Data

Gombar, S.; Shah, N.; Sanghavi, N.; Coyle, J.; Mukerji, A.; Chappelka, M.

2026-04-07 health informatics 10.64898/2026.04.07.26350300 medRxiv

Top 0.1%

6.3%

Show abstract

Background: The observational literature on comparative effectiveness is expanding rapidly but remains difficult to synthesize. Discordant findings often stem from structural differences in cohort definitions, inclusion criteria, and follow up windows, leaving stakeholders without a cohesive evidence base. Furthermore, studies typically focus on a narrow subset of outcomes, neglecting the broader needs of diverse healthcare stakeholders 1,2,3,4. Methods We developed a high throughput evidence generation workflow using linked EHR and administrative claims data. The cornerstone is a prespecified measurement architecture applied uniformly across clinical scenarios: six post index windows (acute to two year follow.up); 28 Elixhauser comorbidities; 14 healthcare resource utilization (HCRU) categories; 29 laboratory measures with 52 binary thresholds; and 42 adverse event categories. We generated unadjusted treatment comparisons across ~1,038 outcomes per scenario, including effect-measure modification (EMM) assessments across 130 baseline features. Results Across 40 clinical domains, the workflow produced approximately 32,982,552 outcome evaluations. An evaluation included a treatment comparison outcome population effect estimate with uncertainty bounds and supporting diagnostics. Approximately 5,000 narrative summaries underwent structured clinical and statistical quality control before dissemination. Conclusions Standardized, high throughput workflows can shift evidence generation away from fragmented studies toward comprehensive evidence packages. This shared evidence base supports precision medicine by making treatment effect heterogeneity visible across clinically meaningful subpopulations, reducing the need for redundant, stakeholder-specific studies.

6

Trade-offs in Cardiovascular Risk Prediction Using Race and Social Determinants of Health

Hammarlund, N.; Wang, X.; Grant, D.; Purves, D.

2026-04-04 cardiovascular medicine 10.64898/2026.04.02.26350089 medRxiv

Top 0.1%

6.2%

Show abstract

Importance: Health systems are increasingly adopting race-neutral cardiovascular risk prediction tools, yet no study has examined how these choices redistribute preventive treatment at the point of clinical decision-making, particularly for Black individuals who already bear a disproportionate cardiovascular burden. Objective: To evaluate how including race, substituting social determinants of health (SDoH), or excluding both reshapes cardiovascular risk classification, calibration, fairness, and clinical decisions. Design: Retrospective cohort study with repeated cross-validation and integrated decision-focused evaluation, using CARDIA study data with baseline measures from 2010 and cardiovascular outcomes through 2021. Setting: Community-based longitudinal cohort recruited across multiple U.S. cities. Participants: 3,241 Black and White adults without known cardiovascular disease at baseline. Main Outcomes and Measures: Three models predicting 10-year incident cardiovascular disease were compared on predictive performance, calibration, fairness metrics, and realized clinical utility at the ACC/AHA 7.5% preventive treatment threshold. Results: Among 3,241 participants (46% Black, mean age 50 years, 6.9% CVD incidence), overall performance was similar across models (AUC 0.762 to 0.768). Predictor choice substantially reshaped clinical decisions at the guideline threshold. The SDoH-based model improved parity metrics but produced systematic underprediction and concentrated new overtreatment among Black participants. The clinical-only model further improved parity metrics but generated new undertreatment, with four cases of untreated CVD and none avoided. No single evaluative dimension captured the full equity consequences. Conclusions and Relevance: Parity metrics improved under both race-neutral models, yet both produced clinical harms concentrated among Black participants not apparent in population-average metrics. The case for race removal has rested on conceptual grounds, but comprehensive empirical evaluation is necessary before health systems can be confident their model choices truly serve those most at risk.

7

Prognosis of stroke subtypes in whole population health systems data: a matched cohort study

Hosking, A.; Iveson, M. H.; Sherlock, L.; Mukherjee, M.; Grover, C.; Alex, B.; Parepalli, S.; Mair, G.; Doubal, F.; Whalley, H. C.; Tobin, R.; Wardlaw, J. M.; Al-Shahi Salman, R.; Whiteley, W. N.

2026-04-20 neurology 10.64898/2026.04.17.26351150 medRxiv

Top 0.1%

6.1%

Show abstract

Background Outcome after stroke varies according to stroke subtype by location, but healthcare systems data studies do not include subtyping information. We linked natural language processing (NLP) of brain imaging reports to routinely collected data to estimate risk of death and other outcomes after stroke subtypes in a nationwide dataset. Methods We applied a previously validated NLP algorithm to all CT and MRI head scan reports in Scotland between 2010 and 2018. We linked the reports to hospital readmissions, prescriptions and death data to identify and characterize people with stroke, and to categorize into deep and cortical ischemic stroke, deep and lobar intracerebral hemorrhage (ICH), subarachnoid hemorrhage, and subdural hemorrhage. We used a matched cohort design, and age- and sex-matched four controls per case who never had a stroke. By subtype, we estimated rehospitalization with stroke, myocardial infarction (MI), cancer, dementia, epilepsy and death, accounting for confounders and competing risk of death. Results From 785,331 people with a head scan, we identified 64,219 with clinical stroke phenotypes (mean age 73.4yrs, 49.5% male), and subtyped 12,616 with deep ischaemic stroke; 14,103 with cortical ischaemic stroke; 1,814 with deep ICH; and 1,456 with lobar ICH. There was higher absolute rate of 1-year hospital readmission for lobar compared with deep ICH (4.9% [95%CI 3.9% - 6.1%] vs 3.4% [2.6% - 4.3%]), higher risk of dementia beyond 6 months after lobar ICH compared to controls than for other stroke subtypes (aHR 3.5 [2.3-5.3]); and higher risk of MI within 6 months of cortical ischemic stroke than for other stroke subtypes (aHR 4.6 [3.4-6.3]). Conclusions NLP of free-text reports linked to coded data successfully subtyped stroke at scale, and we estimated risk of clinically relevant outcomes. Future work should use free text to enable large-scale audit and epidemiology of people with stroke.

8

The FEES Dysphagia Index: a bias-resilient continuous score that captures expert clinical judgment in 2,943 neurological inpatients

Werner, C. J.; Sanchez-Garcia, E.; Mall, B.; Meyer, T.; Pinho, J.; Schulz, J. B.; Schumann-Werner, B.

2026-04-21 neurology 10.64898/2026.04.20.26351259 medRxiv

Top 0.1%

4.8%

Show abstract

Multi-consistency testing during flexible endoscopic evaluation of swallowing (FEES) is clinically necessary but introduces selection bias: worst scores inflate severity because the number of consistencies tested covaries with disease severity. In this retrospective observational study of hospitalized neurological patients, we derived and validated the FEES Dysphagia Index (FDI) in two temporally independent cohorts (Cohort 1: 2013-2018, N=1,257; Cohort 2: 2021-2025, N=1,686) from a single center. FDI-S averages Penetration-Aspiration Scale (PAS) scores across tested consistencies (0-100 scale); FDI-E uses Yale Pharyngeal Residue scores; FDI-C combines both. Selection bias was quantified using sequential branching-tree inverse probability weighting (IPW). Worst PAS overestimated severity by 24%; FDI deviated by <2%. FDI-C was significantly superior to Worst PAS for hospital-acquired pneumonia (HAP; AUC 0.70 vs. 0.60, p<0.001), mortality (0.71 vs. 0.62, p=0.040), and restricted oral intake (0.90 vs. 0.74, p<0.001), and statistically equivalent to clinician-rated severity. FDI-C mapped linearly onto ordinal Functional Oral Intake Scale values (FOIS; proportional odds RCS p=0.99). With functional status and diagnosis, FDI-C reconstructed the clinicians oral intake recommendation with AUC up to 0.93. The FDI-C-mortality relationship was sigmoidal with a clinically relevant transition zone between [~]50 and [~]85. FDI-C is a bias-resilient, bedside-calculable score with interval-scale properties that captures expert clinical judgment, suitable as both a clinical decision support tool and a continuous research endpoint.

9

Multi-Task Learning and Soft-Label Supervision for Psychosocial Burden Profiling in Cancer Peer-Support Text

Wang, Z.; Cao, Y.; Shen, X.; Ding, Z.; Liu, Y.; Zhang, Y.

2026-04-04 health informatics 10.64898/2026.04.03.26350034 medRxiv

Top 0.1%

4.8%

Show abstract

Objective: Online cancer peer-support text contains signals of psychosocial burden beyond emotional tone, including treatment burden, financial strain, uncertainty, and unmet support needs. We evaluated 2 modeling extensions: multi-task learning (MTL) for joint prediction of health economics and outcomes research (HEOR) burden dimensions, and soft-label supervision using large language model (LLM)-derived probability distributions. Materials and Methods: We analyzed 10,392 cancer peer-support posts. GPT-4o-mini generated proxy annotations for HEOR burden subscales, composite burden, high-need status, speaker role, cancer type, and emotion probabilities. Study 1 trained a shared ALBERT encoder under 4 MTL conditions: composite and subscale burden targets, each with and without auxiliary heads, using Kendall uncertainty weighting. Study 2 compared soft-label training on LLM emotion distributions with hard-label baselines under regular and token-augmented inputs, evaluating performance against both human labels and AI distributions. Results: Composite-only MTL achieved R2=0.446 for burden regression and weighted F1=0.810 for high-need screening; subscale classification achieved mean weighted F1=0.646. Adding auxiliary role and cancer-type heads reduced regression performance ({triangleup}R2 = -0.209). Soft-label training reduced weighted F1 by 0.16 versus hard-label baselines (0.68 vs. 0.86), and token augmentation did not improve performance under soft supervision. Discussion: Composite-only MTL supported modeling of multidimensional burden-related signals from forum text, whereas auxiliary prediction heads appeared to compete with primary tasks. Soft-label training aligned poorly with human-labeled emotion categories, suggesting that uncalibrated LLM distributions may propagate bias rather than improve supervision. Conclusion: Composite-only MTL was the strongest burden-modeling approach, and hard-label supervision remained preferable for emotion classification.

10

Language-Related Differences in Prenatal Depression Screening Uptake, US Midwest 2019-2024

Luff, A.; Rivelli, A.; Akaninyene, N.; Malloy, E.; Mishra, R.; Fitzpatrick, V.

2026-04-08 obstetrics and gynecology 10.64898/2026.04.07.26350332 medRxiv

Top 0.1%

4.4%

Show abstract

Prenatal depression is a substantial contributor to maternal morbidity, and screening is an entry point to psychiatric assessment and treatment during pregnancy. Following updated guidelines and quality metrics for prenatal depression screening, we evaluated whether screening uptake differed by preferred language within a large U.S. healthcare system. We used electronic health record data to identify a retrospective cohort of deliveries at or beyond 20 weeks gestation in 2019-2024. We used logistic regression with a language-year interaction to estimate the adjusted marginal probabilities of screening by language preference. Among 99,526 pregnancies (82,632 individuals), screening increased substantially over time but increases differed across language groups (p<0.001). In 2019, screening probabilities were similar (English 0.50; Spanish 0.48; Another Language 0.50). By 2024, probabilities diverged (English 0.81; Spanish 0.66; Another Language 0.71). Unequal screening uptake can systematically under-identify prenatal depression among patients with non-English language preference, with implications for equitable access to psychiatric care.

11

Data Resource Profile: EST-Health-30

Reisberg, S.; Oja, M.; Mooses, K.; Tamm, S.; Sild, A.; Talvik, H.-A.; Laur, S.; Kolde, R.; Vilo, J.

2026-04-24 epidemiology 10.64898/2026.04.21.26351087 medRxiv

Top 0.1%

3.7%

Show abstract

Background: The increasing availability of routinely collected health data offers new opportunities for population-level research, yet access to comprehensive, linked, and standardised datasets remains limited. We describe EST-Health-30, a large-scale, population-representative health data resource from Estonia. Methods: EST-Health-30 comprises a random 30% sample of the Estonian population (~500,000 individuals), with longitudinal data from 2012 to 2024 and annual updates planned through 2026. Individual-level records are linked across five nationwide databases, including electronic health records, health insurance claims, prescription data, cancer registry, and cause of death records. A privacy-preserving hashing approach ensures consistent cohort inclusion over time while maintaining pseudonymisation. All data are harmonised to the Observational Medical Outcomes Partnership (OMOP) Common Data Model (version 5.4) using international standard vocabularies. Data quality was assessed using established OMOP-based validation frameworks. Results: The dataset contains rich multimodal information on diagnoses, procedures, laboratory measurements, prescriptions, free-text clinical notes, healthcare utilisation, and costs, with high population coverage and longitudinal depth. Data quality assessment showed high completeness and consistency, with 99.2% of applicable checks passing. The age-sex distribution closely reflects the national population, supporting representativeness, though coverage is marginally below the target 30% (29.2%), primarily attributable to recent immigrants without health system contact. The dataset enables construction of detailed clinical cohorts, analysis of disease trajectories, and evaluation of healthcare utilisation and outcomes across the life course. Conclusions: EST-Health-30 is a comprehensive, standardised, and population-representative real-world data resource that supports epidemiological, clinical, and methodological research. Its alignment with the OMOP CDM facilitates reproducible analytics and participation in international federated research networks, while secure access infrastructure ensures compliance with data protection regulations.

12

Most Instability Phases Resolve: Empirical Evidence for Trajectory Plasticity in Multimorbidity Care from Longitudinal Relational Monitoring

Martin, C. M.; henderson, i.; Campbell, D.; Stockman, K.

2026-04-24 health informatics 10.64898/2026.04.22.26351537 medRxiv

Top 0.1%

3.6%

Show abstract

Background: The instability-plasticity framework proposes that multimorbidity trajectories periodically enter instability phases that are vulnerable to escalation but also potentially modifiable through relational intervention. Whether such phases commonly resolve without acute care, or predominantly progress to hospitalisation, has not been quantified at scale. Objective: To quantify instability window outcomes across a longitudinal monitoring cohort; to test whether the characteristics distinguishing admitted from resolved windows reflect within-patient trajectory dynamics or between-patient severity; and to characterise which patient-reported and operator-rated signals reliably precede admission, using both a curated pilot sub-cohort and the full monitoring cohort with an explicit cross-cohort comparison. Methods: Two complementary analyses were conducted on data from the MonashWatch Patient Journey Record (PaJR) relational telehealth system. Instability windows were identified algorithmically (>=2 consecutive calls with Total_Alerts >=3) across the full longitudinal dataset (16,383 calls, 244 patients, 2.5 years) and classified by linkage to ED and hospital admission data. Window characteristics were compared at window, patient, and paired within-patient levels. Pre-admission signal cascades were analysed in two configurations: a curated pilot sub-cohort (64 patients, 280 calls, +/-10-day window, 103 admissions, December 2016-September 2017) and the full monitoring cohort (175 patients, 1,180 pre-admission calls, +/-14-day window, December 2016-July 2019). A three-way cross-cohort comparison decomposed differences between the two configurations into pipeline and population effects. Results: 621 instability windows were identified across 157 patients (64% of the monitored cohort). 67.3% resolved without hospital admission or ED attendance, a rate stable across alert thresholds 1-5. In paired within-patient analysis (n = 70), duration in days (p = 0.002) and multi-domain breadth (p < 0.001) distinguished admitted from resolved windows; alert intensity did not. In the pilot sub-cohort, patient-reported illness prognosis (Q21) was the dominant pre-admission signal (GEE beta = +0.058, AUC = 0.647, p-BH = 0.018). This finding did not replicate in the full cohort: Q21 was non-significant (GEE beta = -0.008, p = 0.154, AUC = 0.507). Cross-cohort analysis identified selective curation of the pilot sub-cohort as the primary explanation. In the full cohort, six signals escalated significantly before admission after Benjamini-Hochberg correction: total alerts, health impairment (Q26), red alerts, self-rated health (Q3), patient concerns (Q1), and operator concern (Q34). Health impairment achieved the highest individual AUC (0.605) and showed the longest pre-admission lead. No individual signal exceeded AUC 0.61. Conclusions: Two thirds of instability phases resolve without hospitalisation, providing direct empirical support for trajectory plasticity as a clinically frequent phenomenon. Within the same patient, persistence - in duration and in the consistency of high-severity multi-domain flagging across calls - distinguishes trajectories that tip into admission from those that resolve. The Q21 signal reversal between cohorts illustrates how selective curation can produce compelling but non-replicable findings in monitoring research. In the full population, objective alert signals and operator judgement, rather than patient illness prognosis, carry the pre-admission signal

13

Health Impact Assessment of BRCA1/2 Cascade Screening for the Personalized Prevention of Hereditary Breast and Ovarian Cancers in Italy

Valz Gris, A.; Giacobini, E.; Tricomi, V.; Rumi, F.; Valentini, I.; Cristiano, A.; Testa, S.; Rosano, A.; Pezzullo, A. M.; Boccia, S.

2026-04-15 public and global health 10.64898/2026.04.13.26350758 medRxiv

Top 0.1%

3.5%

Show abstract

Introduction Pathogenic germline variants in the BRCA1 and BRCA2 genes confer a markedly increased risk of breast and ovarian cancer, for which effective preventive strategies are available. Although national and international guidelines recommend BRCA testing and cascade screening of relatives, implementation in Italy remains highly heterogeneous across regions. This study estimates the potential population health and cost impact of achieving full nationwide implementation of BRCA1/2 cascade screening in Italy and identifies key organisational barriers and priority actions for implementation. Methods We conducted a Health Impact Assessment integrating literature review, simulation modelling, and stakeholder consultation. A decision tree and Markov model compared the current heterogeneous implementation of BRCA screening in Italy with an ideal scenario reflecting full adherence to national guidelines, optimal cascade screening, and uptake of preventive strategies. Outcomes included breast and ovarian cancer incidence and mortality, healthcare costs over a lifetime horizon (80 years). Key barriers affecting organisational feasibility, acceptability, and patient well-being were assessed, and a set of priority action recommendations was developed. Results In the ideal scenario, 25,626 eligible cancer patients would undergo BRCA testing annually, identifying 4,254 mutation carriers and enabling cascade testing of 27,650 relatives, of whom 8,682 would be BRCA-positive. Under the current implementation, only 8,807 patients and 2,168 relatives are tested, identifying 948 carriers. Over 30 years, full implementation would prevent 821 cancer cases (- 27.9%) and 1,282 deaths (- 49.7%) compared with the current scenario. While initial expenditures increase due to expanded testing and preventive interventions, cumulative costs decrease over time, resulting in net savings of 5.8 million euros at 30 years and a saving per event avoided (- 2,779 euros). Major implementation barriers include fragmented governance, limited access to genetic counselling, heterogeneous laboratory practices, insufficient professional training, and weak referral pathways. Conclusion Full implementation of BRCA1/2 cascade screening in Italy would yield substantial population health benefits and long-term cost savings. Coordinated national governance, standardised pathways, investment in counselling and workforce capacity, and robust monitoring systems are essential to ensure equitable access and sustainable delivery of personalised cancer prevention. This study demonstrates the value of the HIA methodology for evaluating and guiding genomic prevention policies.

14

Temporal features of the built environment and associations with drowning mortality: A global satellite-based analysis

Essex, R.; Lim, S.; Jagnoor, J.

2026-04-21 public and global health 10.64898/2026.04.19.26351237 medRxiv

Top 0.2%

3.5%

Show abstract

BackgroundDrowning remains a major global public health challenge. This study examined whether the timing and trajectories of urbanisation--beyond the current built environment--are associated with subnational drowning mortality. MethodsWe linked satellite-derived measures of built-environment change (GHSL), population crowding (WorldPop), surface water exposure (JRC Global Surface Water), and infrastructure proxies (VIIRS/DMSP nighttime lights) to GBD 2021 drowning mortality estimates across 203 ADM1 regions in 12 countries (2006-2021; 3,248 region-year observations). Temporal predictors captured recent expansion, development "newness" ([≤]10-year built share), acceleration/volatility, and a crowdingxgrowth interaction. We screened predictors using LASSO (10-fold cross-validation) and fitted mixed-effects models with region random intercepts. Distributed-lag models tested temporal precedence and development age, and income-stratified models assessed heterogeneity. ResultsAdding temporal predictors improved fit beyond contemporaneous built-environment measures ({Delta}AIC=177; {Delta}BIC=147). In adjusted models, crowdingxgrowth was strongly positively associated with drowning mortality, and a higher share of recent development was associated with higher mortality. Lag models showed a development age gradient: older built environment was most protective. Associations differed by income group, with several key coefficients reversing sign across strata. DiscussionDrowning mortality appears shaped by development histories as well as present-day conditions, with risk concentrated in rapidly changing, dense settings and the newest built environments. Cross-context heterogeneity suggests mechanisms and prevention priorities are unlikely to be uniform. ConclusionsDevelopment timing and trajectories help explain subnational drowning mortality beyond current built form alone. Prevention and planning should prioritise transition-period safety strategies in newly developing and rapidly densifying areas.

15

Design and preliminary safety validation of a hybrid deterministic-AI triage system for multilingual primary healthcare: a WhatsApp-based vignette study in South Africa

Nkosi-Mjadu, B. E.

2026-04-22 health informatics 10.64898/2026.04.21.26349781 medRxiv

Top 0.2%

3.2%

Show abstract

BackgroundSouth Africas public healthcare system serves most of the population through approximately 3,900 primary healthcare clinics characterised by long waiting times and high volumes of repeat-prescription visits. No published pre-arrival digital triage system operates across all 11 official South African languages while aligning with the South African Triage Scale (SATS). This paper reports the design and preliminary safety validation of BIZUSIZO, a hybrid deterministic-AI WhatsApp triage system. MethodsBIZUSIZO delivers SATS-aligned triage via WhatsApp, combining AI-assisted free-text classification (Claude Haiku 4.5) with a Deterministic Clinical Safety Layer (DCSL) that overrides AI output for 53 clinical discriminator categories (14 RED, 19 ORANGE, 20 YELLOW) coded in all 11 official languages and independent of AI availability. A five-domain risk factor assessment can only upgrade triage level. One hundred and twenty clinical vignettes in patient language (English, isiZulu, isiXhosa, Afrikaans; 30 per language) were scored against a developer-assigned gold standard with independent blinded nurse review. A 121-vignette multilingual DCSL safety consistency check across all 11 languages and a 220-call post-hoc framing sensitivity evaluation (110 paired vignettes) were also conducted. ResultsUnder-triage was 3.3% (4/120; 95% CI: 0.9%-8.3%) with no RED under-triage; exact concordance was 80.0% (96/120) and quadratic weighted kappa 0.891 (95% CI: 0.827-0.932). One two-level under-triage was observed on a non-RED presentation (V072, isiXhosa burns vignette, ORANGEGREEN); one two-level over-triage was observed (V054, isiZulu deep laceration, YELLOWRED). In the framing sensitivity evaluation, AI-only classification achieved 50.9% RED invariance under adversarial framing; full-pipeline classification achieved 95.0% in four validated languages, with the DCSL rescuing 18 of 23 AI drift cases. ConclusionsA hybrid deterministic-AI triage system with DCSL-based emergency detection achieved zero RED under-triage and consistent RED detection across all 11 official languages. The 16.7% over-triage rate falls within published South African SATS ranges (13.1-49%). A single two-level under-triage event was observed on an isiXhosa burns vignette (ORANGEGREEN) and is discussed in Limitations. Findings are preliminary; prospective validation against independent nurse triage is the necessary next step.

16

A Systematic Exploration of LLM Behavior for EHR phenotyping

Yamga, E.; Murphy, S.; Despres, P.

2026-04-24 health informatics 10.64898/2026.04.16.26350890 medRxiv

Top 0.2%

3.1%

Show abstract

Background Electronic health record (EHR) phenotyping underpins observational research, cohort discovery, and clinical trial screening. Large language models (LLMs) offer new capabilities for extracting phenotypes from unstructured text, but their performance depends on pipeline design choices-including prompting, text segmentation, and aggregation. No systematic framework has previously examined how these parameters shape accuracy and reproducibility. Methods We evaluated LLM-based phenotyping pipelines using 1,388 discharge summaries across 16 clinical phenotypes. A full factorial experiment with LLaMA-3B, 8B, and 70B systematically varied three pipeline components: prompting (zero-shot, few-shot, chain-of-thought, extract-then-phenotype), chunking (none, naive, document-based), and aggregation (any-positive, two-vote, majority), yielding 24 configurations per model. To compare intrinsic model capabilities, biomedical domain-adapted, commercial frontier (LLaMA-405B, GPT-4o, Gemini Flash 2.0), and reasoning-optimized models (DeepSeek-R1) were evaluated under a fixed configuration. Performance was assessed using precision, recall, and macro-F1; secondary analyses examined prediction consistency (Shannon entropy), self-confidence calibration, and the development of a taxonomy of recurrent model errors. Results Factorial ANOVAs showed that chunking and aggregation were the dominant drivers of performance, whereas the prompting strategy contributed minimally. Configuration effects were stable across model sizes, with no significant Model x Parameter interactions. Phenotype difficulty varied substantially (macro-F1 = 0.40-0.90), yet the highest-performing configuration-whole-document inference without aggregation-was consistent across phenotypes, as confirmed by mixed-effects modeling. In cross-model comparisons, DeepSeek-R1 achieved the highest macro-F1 (0.89), while LLaMA-70B matched GPT-4o and LLaMA-405B at substantially lower cost. Prediction entropy was low overall and driven primarily by phenotype difficulty rather than prompting or temperature. Self-confidence calibration was only moderately informative: high-confidence predictions were more accurate, but larger models exhibited systematic overconfidence. Conclusions LLM performance in EHR phenotyping is governed primarily by input structure and model capacity, not prompt engineering. Simple, document-level inference yields robust performance across diverse phenotypes, providing practical design guidance for LLM-based cohort identification while underscoring the continued need for human oversight for challenging phenotypes.

17

Medicalbench: Evaluating Large Language Models Towards Improved Medical Concept Extraction

Yang, Z.; Lyng, G. D.; Batra, S. S.; Tillman, R. E.

2026-04-16 health informatics 10.64898/2026.04.12.26350704 medRxiv

Top 0.2%

3.0%

Show abstract

Medical concept extraction from electronic health records underpins many downstream applications, yet remains challenging because medically meaningful concepts, such as diagnoses, are frequently implied rather than explicitly stated in medical narratives. Existing benchmarks with human-annotated evidence spans underscore the importance of grounding extracted concepts in medical text. However, they predominantly focus on explicitly stated concepts and provide limited coverage of cases in which medically relevant concepts must be inferred. We present MedicalBench, a new benchmark for medical concept extraction with evidence grounding that evaluates implicit medical reasoning. MedicalBench formulates medical concept extraction as a verification task over medical note concept pairs, coupled with sentence level evidence identification. Built from MIMIC-IV discharge summaries and human verified ICD-10 codes, the dataset is curated through a multi stage large language model (LLM) triage pipeline followed by medical annotation and expert review. It deliberately includes implicit positives, semantically confusable negatives, and cases where LLM judgments disagree with medical expert assessments. Annotators provide sentence level evidence spans and concise medical rationales. The final dataset contains 823 high quality examples. We define two complementary evaluation tasks: (1) medical concept extraction and (2) sentence level evidence retrieval, enabling assessment of both correctness and interpretability. Benchmarking state-of-the-art LLMs and a supervised baseline reveals that performance remains modest, highlighting the difficulty of extracting implicitly expressed concepts. We further show that explicitly incorporating reasoning cues and prompting to extract implicit evidence substantially improves medical concept extractions, while performance is largely invariant to note length, indicating that MedicalBench isolates reasoning difficulty rather than superficial confounders. MedicalBench provides the first systematic benchmark for implicit, evidence-grounded medical concept extraction, offering a foundation for developing medical language models that can both identify medically relevant concepts and justify their predictions in a transparent and medically faithful manner.

18

Perioperative Mortality Prediction Using a Bayesian Ensemble with Prevalence-Adaptive Gating

Pandey, A. K.

2026-04-06 health informatics 10.64898/2026.04.03.26350114 medRxiv

Top 0.2%

2.8%

Show abstract

Background: Perioperative mortality prediction in resource-limited surgical settings remains challenging due to class imbalance, missing data, and the heterogeneity of postoperative complications. Existing risk scores such as POSSUM depend on intraoperative variables and do not quantify prediction uncertainty. Methods: We developed a prevalence-adaptive Bayesian ensemble comprising three stochastic models: classifier Variational Autoencoder (VAE, AUC=0.95), a Flipout Last Layer network (AUC=0.84), and a Monte Carlo Dropout network (AUC=0.80), trained on 697 patients (39 deaths, prevalence 5.59%) with 67 preoperative and postoperative features. Class imbalance (16.9:1) was addressed through Variational Autoencoder augmentation: two class-conditional generative VAEs produced 619 synthetic survivor and 619 synthetic death records, yielding a balanced training corpus of 1,935 samples. VAE augmentation was selected over SMOTE and random oversampling after a comparative study (F1: random oversampling 0.61 vs VAE augmentation 0.77). Validation used a held-out set of 233 patients (13 deaths, 220 survivors). A six-stage prediction pipeline incorporated weighted base risk, a three-path prevalence-adaptive gate, Shannon entropy uncertainty quantification, and rank-transform calibration. Sensitivity analysis was conducted across all six empirically derived hyperparameters. A whole-cohort death audit evaluated all 52 deaths from the complete 930-patient dataset through the deployed clinical decision support system. Statistical analysis included Kruskal-Wallis testing of entropy across triage groups, Wilson score confidence intervals for performance metrics, and Spearman rank correlation for LIME-SHAP interpretability concordance. Results: On the validation cohort the ensemble achieved complete separation (sensitivity 100%, specificity 100%, Youden J=1.000; TP=13, FP=0, TN=220, FN=0). The whole-cohort death audit identified 36 of 52 deaths (sensitivity 69.2%, 95% CI 55.7%-80.1%; precision 100%, 95% CI 90.4%-100.0%; F1=0.818, bootstrap 95% CI 0.732-0.894). Shannon entropy differed significantly across triage levels (Kruskal-Wallis H(2)=24.212, p<0.001, {epsilon}2=0.453), confirming a monotone gradient SAFE < CRITICAL < GRAY ZONE. All six hyperparameters were invariant across their tested ranges (J=1.000 throughout; Supplementary Tables S1-S2). LIME and SHAP rankings showed statistically significant concordance (Spearman {rho}=0.440, p=0.024; Kendall T=0.357, p=0.011), with 4 of 6 principal mortality determinants shared across both methods. Conclusions: A prevalence-adaptive Bayesian ensemble with entropy-based uncertainty triage achieves zero false positive alerts and clinically meaningful audit sensitivity in perioperative mortality prediction. Complete hyperparameter invariance confirms that reported performance reflects structural properties of the calibration architecture. The 16 missed deaths represent feature-invisible cases beyond current observational feature capacity.

19

HealthFormer: Dual-level time-aware Transformers for irregular electronic health record events

Körösi-Szabo, P.; Kovacs, G.; Csiszarik, A.; Forrai, B.; Laki, J.; Szocska, M.; Kovats, T.

2026-03-27 health informatics 10.64898/2026.03.25.26349262 medRxiv

Top 0.2%

2.7%

Show abstract

Longitudinal electronic health records (EHRs) form irregular event sequences that mix multiple clinical coding systems and care settings. Learning transferable patient representations requires modeling both within-encounter code composition and long-range temporal dependencies. We aim to develop a pretraining framework that preserves event structure and explicitly uses elapsed time, while remaining straightforward to fine-tune for new supervised endpoints without task-specific feature engineering. We propose HealthFormer, a dual-level Transformer for event-centric EHR modeling. An Intra-Event Encoder aggregates heterogeneous domain tokens within each typed clinical event into an event embedding via code-specific embedding modules and attention pooling. Event embeddings are combined with a Date Encoder and a continuous-time attention bias based on attention with linear biases (ALiBI) inside an Inter-Event Encoder. We pretrain on Hungarian national administrative health records from a large-scale nationwide longitudinal cohort (spanning millions of individuals over a decade) using multi-task self-supervision with (i) per-domain masked token prediction (masked language modeling, MLM), (ii) event-type prediction under full-event masking (Event-level MLM), (iii) next-event type prediction, and (iv) time-to-next-event ({Delta}t) regression. Pretraining induces hierarchy-consistent organization in learned diagnosis (ICD-10) embedding geometry conducive to analysis and interpretation. On incident cancer prediction, end-to-end fine-tuning achieves test AUCs of 0.81/0.75/0.73 for colorectal cancer (CRC) and 0.94/0.87/0.84 for prostate cancer across 30/60/90-day horizons on balanced cohorts, outperforming logistic-regression baselines, including time-decayed bag-of-codes. HealthFormer provides an event-centric, time-aware representation that transfers via standard fine-tuning without endpoint-specific designs. Using ICD-10 diagnoses and ATC codes can facilitate adoption beyond Hungary. Learned diagnosis embeddings align with the hierarchy, enabling clinical inspection. Broader benchmarking across endpoints remains needed.

20

Addition of Bupropion or Varenicline to Nicotine Replacement Therapy After Acute Coronary Syndrome: A Propensity-Matched Real-World Analysis

Qadeer, A.; Gohar, N.; Maniyar, P.; Shafi, N.; Juarez, L. M.; Mortada, I.; Pack, Q. R.; Jneid, H.; Gaalema, D. E.

2026-04-23 cardiovascular medicine 10.64898/2026.04.21.26351432 medRxiv

Top 0.2%

2.6%

Show abstract

Introduction: Smoking cessation after acute coronary syndrome (ACS) is a Class I recommendation, yet prescription pharmacotherapy use remains low and its real-world cardiovascular effectiveness when added to nicotine replacement therapy (NRT) is poorly characterized. Methods: We conducted a retrospective cohort study using the TriNetX US Collaborative Network (67 healthcare organizations). Adults hospitalized with ACS who received NRT within one month, serving as a proxy for active smoking status, were identified. Two co-primary propensity-matched (1:1, 50 covariates, caliper 0.10 SD) comparisons evaluated bupropion + NRT and varenicline + NRT individually versus NRT alone; a supportive analysis evaluated combined pharmacotherapy versus NRT alone. All-cause mortality was the primary endpoint. Secondary outcomes included MACE, heart failure exacerbations, major bleeding, TIA/stroke, emergency rehospitalizations, and cardiac rehabilitation utilization, assessed at 6 months and 1 year via Kaplan-Meier analysis. Hazard ratios (HRs) greater than 1.0 indicate higher hazard in the NRT-only group. Results: After matching, the combined analysis comprised 8,574 pairs, the bupropion analysis 4,654 pairs, and the varenicline analysis 2,126 pairs. At 1 year, the combined pharmacotherapy group had significantly lower all-cause mortality (HR 1.26, 95% CI 1.16-1.37), MACE (HR 1.16, 95% CI 1.12-1.21), heart failure exacerbations (HR 1.16, 95% CI 1.08-1.25), major bleeding (HR 1.18, 95% CI 1.08-1.28), and greater cardiac rehabilitation utilization (HR 0.82, 95% CI 0.74-0.92; all p < 0.001). TIA/stroke did not differ significantly. Six-month results were consistent. Both varenicline and bupropion individually showed lower mortality and MACE. A urinary tract infection falsification endpoint showed no between-group differences, supporting matching validity. The pharmacotherapy group had higher rates of new-onset depression, driven predominantly by bupropion recipients. Conclusions: In this propensity-matched real-world analysis, adding prescription smoking cessation pharmacotherapy to NRT after ACS was associated with lower mortality and fewer adverse cardiovascular events, supporting broader integration into post-ACS care pathways.